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Porting Plan 9 to supercomputers 


Because it's a clean, small system 


© 


Flexibility comes in at user level 


© 


One question of the DOE work: can we remove OS bypass if 
the kernel is fast enough? 


© 


Simulations said “maybe” 


© 
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simulation 


e Using IBM SystemSim, boot Plan 9, and run a program that 
does a single write 


ə acid: 0x0119dd39 n = r;==>/9k/port/sysfile.c:790 

ə acid: 0x0119dd3a n = r;==>k/port/sysfile.c:790 

e acid: 0x0119dd3b off = ~OLL;==>9k/port/sysfile.c:792 

ə acid: 0x0119dd3c off = ~OLL;==>9k/port/sysfile.c:792 etc. 
e About 600 ticks 

e About 180 lines 


ə Seemed like it would be quite fast 
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Context 


as we want 


e On simulation we had thought the path from user to kernel to 
wire was fast 


€ Certainly faster than MPI libraries (or so the MPI guys told us) 


e Measurement on real hardware showed it was actually slower 
than sim by too much 
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| barrier driver 


dcrput(p->set, 1); /* signal */ 


e That's all there is to it 

9e Across 128K CPUs, this op takes about 125 ns. 

e Other networks are similar 

e HPC approach: just let programs do it directly 

e Our approach: go through a fast kernel 

e But it was not fast enough ... took a significantly longer time 


e What's an acceptable time? Has to be well under 1 
microsecond 
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Where is the time going? 


he time going? 


e Did not have a way to trace, function-by-function, where time 
was spent, and who called whom 


© 


Can do profiling but that is really a "fraction of time spent" 


© 


Hard to see relationships between events 
Profiling is a histogram tool 


© 
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Where is the time going? 
Related Work 


We would rather see 


e Who calls whom 


un 


e What fraction of time | spend in "x" before I call "y 


e Not just "how much time spent in "x" and “y 


e Need to see relationships and ordering of calls 
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Where is the time going? 


e dtrace[1], dkm, neat hardware hacks[3], kprobes[2], djprobes, 
jprobes[4], kernel markers, 
ftrace[http:/ /lwn.net/Articles/270971/] 

e First time | saw it was on SunOS ca. 1988, which used a 
kernel markers like approach 

e Kernel markers are a lot of work, requires annotating 
thousands of points to really get coverage 

e My reading: Linux community may find function tracing is 
"good enough" most of the time (see: ftrace) 

e MacOS, however, has adopted dtrace, which is extremely 
powerful 

e dtrace has two modes: enable always-compiled-in function 
traces, or: 

e Rewrite running kernel binary for more complex tracing 
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Where is the time going? 


e The obvious one: overhead 

e Hence terms like "invasive" and "sample-based" 

e The less obvious one: you just changed a kernel binary 
Was that safe? 

Do all the CPUs know? 

When is it safe to change it back? (answer: maybe never) 
e SMP issues 


© 


© 
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devtrace 


rlier Dynamic Kernel Modifier work (2001?) 


e Code rewriting at runtime 

e Modify code by moving blocks and replacing them with jumps 

e Given an address, rewrite the code at that address to jump to 
a “logger” 

e If you know entry and exit addresses, you can trace a function 

ə Gets a little tricky if you don't want to write an 
object-code-understander-relocater for CISC — and | don't 

e | only moved code known to be "safe" to move, i.e. 
register-register moves etc. 

ə See paper for details 

e GCC function prologues are only a few specific types, so was 
easy — and you only need to move 5 bytes, around instruction 
boundaries 

9 8c is not so gentle 
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devtrace 


Rewritten code with entry, exit modified 


jump to buffer 


function body 


jump to buffer 


~ al 


Call user 
trigger code 


function entry 


Jump to body 


all user function! 


function exit 
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devtrace 


e Relocated code has to be position-independent 
ə There is stack fixup: 


e Have to maintain the stack correctly so calls from this function 
work 


e Have to ensure that the function, on exit, returns to the jump 
to your exit code 


e i.e. we don't rewrite the exit code, we only rewrite the entry 
code 


e |t gets messy but it's doable 


e And it's fun to disable gettimeofday() and watch how things 
slowly fall apart ... 
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devtrace 


‚and Ron) 


e Ron did an early cut based on the Dynamic Kernel Modifier 
work from 2001 


e At IWP9 2, Aki adapted it to the Power PC on Jim's desk 
e We then further took it over to Blue Gene/P 
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devtrace 


€ Short form: on PPC it was pretty high overhead (although 
object-code-understander was not an issue) 


e Worse, it required rewriting bits of the kernel memory image at 
run time 


e Even worse, there is never a guarantee that you know when 
you can turn it off[4] 


e Much less turn it on: are you sure that core 1 is not running 
code while you are busy rewriting it? 


e You can't ensure it by just making sure you write less than one 
cache line of code! 
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devtrace Implementation 


Version 2 goals 


Easily build into kernel 
Easy to control 


Can reliably turn tracing on and off 


No kernel rewrite 
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devtrace Implementation 


Devtrace in a nutshell 


e Plan 9 style text control 
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devtrace Implementation 


Devtrace in a nutshell 


e Plan 9 style text control 


e Textual output 
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devtrace Implementation 


Devtrace in a nutshell 


e Plan 9 style text control 
e Textual output 


e No kernel rewriting 
e Tracing on/off is always safe 
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devtrace Implementation 


Devtrace in a nutshell 


e Plan 9 style text control 
e Textual output 


No kernel rewriting 


Tracing on/off is always safe 


Logic analyzer style interface 
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devtrace 


e Plan 9 style text control 
e Textual output 


e No kernel rewriting 


e Tracing on/off is always safe 
ə Logic analyzer style interface 


Not as powerful as dtrace 
Not as informative as ftrace (I think?) 
Ftrace info can be added 


o o o Oo 


Could produce dtrace format data for dtrace function 
processing 
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devtrace 


tructure 


© 


When you invoke ?| with -p, functions look like this: 


0x00001020 CALL _profin(SB) f+0x5 
0x00001025 MOVL a*OxO(FP),AX f+0x9 
0x00001029 ADDL $0x5,AX £+0xc 
0x0000102c CALL _profout(SB) f+0x11 
0x00001031 RET 


e profin/out give you arbitrary hooks 


e Call sequence only lets you see the pc, no args 


© 


Just gets a histogram, no time relationships 


Ronald G. Minnich, John Floren, Aki Nyrhinen 


devtrace 


our own profin 


TEXT | profin(SB), 1, $0 
TESTL probeactive(SB), AX 
JZ inotready 

MOVL 4(SP),AX 

PUSHL AX 

MOVL 4(SP),AX 

PUSHL AX 

CALL profin(SB) 

POPL AX 

POPL AX 

inotready: RET 
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devtrace 


TEXT | profout(SB), 1, $0 
PUSHL AX 

TESTL probeactive(SB), AX 
JZ notready 

MOVL 4(SP),AX 


PUSHL AX 

CALL profout(SB) 
POPL AX 

notready: POPL AX 
RET 
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devtrace 


e The stack frame already has some things you want 

e Caller PC and some args 

ə Also, on some machines, one register has the "first parameter" 
e Problem is to get them into a machine-independent format 

e On x86, can trash ax on entry; must save it on return 

e on PPC, must save it in both directions 

e Finally, it's important to disable tracing on certain functions 
e Such as profin assembly and C code 


e And anything the profin C code calls 
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Using it 


ur kernel 


© 


contrib/rminnich /9.probe 


© 


Bind these directories over your /sys/src/9 


© 


Note | left v1 code in there for your viewing pleasure 
mk 'CONF—pcprcpf' 


boot kernel and you're ready to try it out 


© 


© 
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ə See the 'probeit' file in 9.probe 


#!/bin/rc nm pc/9pcprcpf | grep $1 | 
awk ’{print "probe Ox" $1 " new "$3}? 
> /dev/probectl 


e That script will set up tracing for one symbol at entry 
e Example in 'probeit' shows a real trace 
e Probing syschdir and namec 


e Showing arguments and so on 
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E f01b626e 000000226c00e7d4 00009ed0 00000000 00000000 ( 


e EorX 

e PC 

e Time in ticks 

e PID 

e Three args for E; return value for X 


ə Fixed, easy to parse format 
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Viewing it 


utseg 
freepte 
sim 
closergrp 
closefgrp 
Sysclöse 
idclose 
syspread 
read 
snprint 
sysopen 
namec 
devopen 
walk 
cclose 
chantree 
pathclose. 
addelem 
uniquepath 
Cópypath 
ewalk 
devwalk 
devgen 
parsename 
rowparse 
pic 
devattacl 
free 
poolfree 
Boolfreel 
newpath 
Sprint 
vsnprint 
dofmt 
{mtdispatch 
validnameduy 
validnamel 
syscall 
syspwrite 
Write 
putstrnü 
qure 
alloc 


allocb 
size 
poolmsize 
mallocz 
i8250interrupt 
uarikick 
i8250kick 
duppage 
copypage 
memrnove 
‘rim 


fixtault 
io 
ptealloc 
Emalloc 
poolalloc 
poolalloc! 
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Plan 9 trace device 


Summary 


© 


We were able to measure where the time was spent 


© 


There were some real time-wasters (incref/decref) 


© 


There were some problems hard to see a way around (okaddr, 
fdtochan) 


We could go to elaborate and complex translation and other 
caching to try to shorten time 


© 


© 


But that seems the wrong path 
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e devtrace can let you see where the time is going 
e Simple textual control and data interface 

e You can see relationships between calls 

e Exploits existing profiling architecture 

e Thanks to SP9SSS for help and advice 
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Summary 


it less intrusive? 


ə Yes. And safer. 
e Consider this code: 
a: JMP 2f 
call profin 
2: 
9 |t becomes: 
EBOS a: jmp 2f 
ESF9FFFFFF call profin 
C3 


ə So, actually, we can change one byte and enable/disable 
profiling on this function 
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ret 
call profout 
ret 


e Same deal: NOP and RET are same size, one byte 
@ So one-byte change can enable/disable profout in this function 


e And it's easy to find the code signature! 
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Summary 


9 


9 


@ 


© 


© 


© 
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8l builds “instructions” (prg()) as part of creating linked binary 
They form a linked list 


if profiling is enabled, 8l does a final-pass walk of list and 
inserts calls to profin/profout on function entry/exit 


How do you know what is entry/exit? 
prg() struct is marked as such 
So, given this least, need only modify how code is inserted 


In code shown below, we have the current entry/exit pointed 
to by ‘p’ 
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e Copy doprof() in obj.c to doprof2 


= prg(); 
q2 = prg(); 
q->line = p->line; q->pc = p->pc; 
q->link = p->link; p->link = q2; 
q2->link = q; 
q2->line = p->line; q2->pc = p->pc; 
q2->as = AJMP; q2->to.type = D_BRANCH; 
q2->to.sym = p->to.sym; q2->pcond = q->link; 
P= q; 
p->as = ACALL; p->to.type = D_BRANCH; 
p->pcond = ps2; p->to.sym = 82; 
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Summary 


Return case 


/* * RET */ 


q = prgO; 

q->as = ARET; q->from = p->from; q->to = p->to; g-l: 
/*  * JAL 

profout  */ p->as = ACALL; p->from = zprg.from; p->t 
p=9 
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Appendix 


ding | 


Dynamic instrumentation of production systems. 


[s] 


Kernel korner: kprobes-a kernel debugger. 


Es w M 
Hardware profiling of kernels, or: How to look under the hood 
while the engine is running. 


E 


Djprobes status. 
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